-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(libp2p): shared TCP listeners and AutoTLS.AutoWSS #10565
Conversation
8cd0e01
to
215ae53
Compare
702101e
to
60d8d5c
Compare
60d8d5c
to
7833e64
Compare
f9c50bd
to
590bed0
Compare
preparing for releases later this week
0eb8455
to
10eed69
Compare
10eed69
to
1831a3d
Compare
go.mod
Outdated
@@ -274,3 +274,5 @@ require ( | |||
gopkg.in/yaml.v3 v3.0.1 // indirect | |||
lukechampine.com/blake3 v1.3.0 // indirect | |||
) | |||
|
|||
replace github.com/libp2p/go-libp2p => github.com/libp2p/go-libp2p v0.37.1-0.20241202220543-9024f8e8c86e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got this to the state that is as close to go-libp2p release as possible, filling this here so I won't forget:
- remove from all go.mod files and switch to go-libp2p release
This adds AutoTLS.AutoWSS flag that is set to true by default. It will check if Addresses.Swarm contain explicit /ws listener, and if not found, it will append one per every /tcp listener This way existing TCP ports are reused without any extra configuration, but we don't break user's who have custom / explicit /ws listener already. I also moved logger around, to include Addresses.Swarm inspection results in `autotls` logger.
Added It simplifies things by removing the need for manually touching I plan to ship this in RC1 as soon we have a new go-libp2p release, |
go.mod
Outdated
@@ -274,3 +274,5 @@ require ( | |||
gopkg.in/yaml.v3 v3.0.1 // indirect | |||
lukechampine.com/blake3 v1.3.0 // indirect | |||
) | |||
|
|||
replace github.com/libp2p/go-libp2p => github.com/libp2p/go-libp2p v0.37.1-0.20241202220543-9024f8e8c86e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be a regression between 9b887ea (this PR) and 433444b (latest master) caused by TCP reuse logic in go-libp2p bumped here.
I did curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
and CPU profile seems to confirm the source is TCP reuse from libp2p/go-libp2p#2984? (cc @aschmahmann @MarcoPolo):
How to reproduce
- Start
ipfs daemon
- Open http://127.0.0.1:5001/webui#/peers and wait a minute:
- this will trigger connections to peers to fetch any missing block get for geoip data (if not cached locally) – acting like accelerator for any underlying bugs in TCP port reuse
- 💢
staging-2024-12-04-9b887e
(go-libp2p masterv0.37.1-0.20241202220543-9024f8e8c86e
) will grow CPU use once ipfs-webui Peers screen is opened. CPU peaks within 1minute. - 💚
master-2024-12-03-433444b
(go-libp2p v0.37.2
) does not trigger high CPU load
Here are docker images for repro convenience:
docker run --rm -it --net=host ipfs/kubo:staging-2024-12-04-9b887ea
docker run --rm -it --net=host ipfs/kubo:master-2024-12-03-433444b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libp2p/go-libp2p#3080 seems to fix the CPU spin.
Tested in 2adb2f1 – need to investigate why CI fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am debugging sharness with make -O -j 1
because CI runs with -j 10
and logs are mangled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting bizzare. Tests with the same go-libp2p and boxo version pass in #10631.
The only go.mod
diff between a436f4e (this PR, ci failing due to timeout after 20m) and 070e6ae (#10631, ci green, sharness run takes 5minutes), is:
- github.com/ipshipyard/p2p-forge v0.1.0
+ github.com/ipshipyard/p2p-forge v0.2.0
other than that, we don't change anything in default config other than enabling shared tcp.
Will disable (b5cfd6d) and re-run, to see if that causes issue somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking for the cause, but I was able to reproduce locally by using the same settings as CI:
$ export TEST_DOCKER=1
$ export TEST_PLUGIN=0
$ export TEST_FUSE=0
$ export TEST_VERBOSE=1
$ export TEST_JUNIT=1
$ export TEST_EXPENSIVE=1
$ export IPFS_CHECK_RCMGR_DEFAULTS=1
$ export CONTINUE_ON_S_FAILURE=1
$ time make -O -j 10 test_sharness coverage/sharness_tests.coverprofile test/sharness/test-results/sharness.xml
# [...]
ok 1717 - add a few entries to big_dir/ to retrigger sharding
expecting success:
kill -0 $IPFS_PID
ok 1718 - 'ipfs daemon' is still running
expecting success:
test_kill_repeat_10_sec $IPFS_PID
ok 1719 - 'ipfs daemon' can be killed
# passed all 1719 test(s)
1..1719
# hangs..
It hanged for a few minutes and finished eventually, but took 3x longer than regular run (15m vs 5m).
Re-run and it hanged for 20m and i had to kill it, so there is some factor of randomness to the hang.
Looking at process tree spawned by make
, the t0250-files-api.sh
and t0260-sharding.sh
t0181-private-network.sh
t0182-circuit-relay.sh
and t0320-pubsub.sh
seem to hang the longest in semi-random order.
Will resume tomorrow unless someone else has any idea.
In the meantime, i've re-run https://github.com/ipfs/kubo/actions/runs/12420930073 to double-confirm disabling libp2p.ShareTCPListener()
fixes sharness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, running sequentially (#10565 (comment)) found the problem:
- Enabling
libp2p.ShareTCPListener()
break PNET and./t0181-private-network.sh
hangs – kinda makes sense, PNET is TCP-only feature of libp2p and we mess up with TCP.
I'll now confirm this is the only one failing,
and if other tests pass, will disable TCP sharing when PNET is enabled for now.
I'll also see if we can set timeout per test suite / unit, to avoid this debugging horror in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news, PNET was the only reason sharness failed. Disabled port sharing when PNET is enabled (bb87df3) and CI is green again:
I'm going forward with Kubo 0.33.0-rc1.
testing fix from libp2p/go-libp2p#3080
switching to p2p-forge which uses go-libp2p@master and removes dependency conflict caused by patch tag nto being part of main branch
main commit with ipshipyard/p2p-forge#23
running again with 1 line diff to produce clear example
once again, after master changes, for a good measure (this commit should make sharness pass)
this re-enabled TCP muxing (if sharness fails due to timeout, we have bug related to libp2p.ShareTCPListener() somehow)
hoping to reproduce circuit-relay.sh repo.lock error found locally
https://github.com/ipfs/kubo/actions/runs/12433829774/job/34716226970?pr=10565#step:6:40911 timeouted after 20m, we need bigger window
let's see if PNET fix was enough
average successful run is <9 minutes, no need to wait for 20 https://github.com/ipfs/kubo/actions/workflows/sharness.yml?query=is%3Asuccess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharness issues are resolved (#10565 (comment)) and docs are updated to include LIBP2P_TCP_MUX=false
as an escape hatch for people playing with this feature in Kubo 0.33.0-rc1, including our staging/cluster tests.
I'm going to merge as soon CI finishes and start Kubo 0.33.0-rc1 dance.
Will post progress in #10580 (comment)
Part of #10560, relies on libp2p/go-libp2p#2984
lidel's TODOs
replace
from allgo.mod
files and switch to go-libp2p release/ws
listener to every/tcp
if no manual/ws
was definedAutoTLS
withAutoWSS
staging-2024-12-04-9b887ea
is capping CPU once ipfs-webui Peers screen is opened, butmaster-2024-12-03-433444b
does not – regression in boxo or go-libp2p bumped in this PR?